Techniques for facilitating identification of candidate genes

ABSTRACT

Techniques for facilitating the identification of candidate genes from a plurality of DNA sequences. According to an embodiment of the present invention, techniques are provided for extracting and integrating information from various information sources and results of various analyses, and storing the integrated information in a form which is conducive to identification of candidate genes. The stored information may include results of a homology search for the plurality of DNA sequences, annotative information for the plurality of DNA sequences indicating the biochemical functions and physiological roles of the plurality of DNA sequences, gene expression profile data for the plurality of DNA sequences describing behavioral patterns of the plurality of DNA sequences, results from clustering the plurality of DNA sequences based on time course data as described by the gene expression profile data, and other information.

CROSS-REFERENCES TO RELATED APPLICATIONS

[0001] This application also claims priority from and is acontinuation-in-part application of non-provisional U.S. patentapplication Ser. No. 09/365,587, entitled “SYSTEM AND METHOD FORIDENTIFYING CRITICAL REGULATED GENES” filed Jul. 30, 1999, the entirecontents of which are herein incorporated by reference in their entiretyfor all purposes.

COPYRIGHT NOTICE

[0002] A portion of the disclosure of this patent document containsmaterial which is subject to copyright protection. The copyright ownerhas no objection to the xerographic reproduction by anyone of the patentdocument or the patent disclosure in exactly the form it appears in theU.S. Patent and Trademark Office patent file or records, but otherwisereserves all copyright rights whatsoever.

BACKGROUND OF THE INVENTION

[0003] The present invention relates generally to the field ofbioinformatics, and more particularly to techniques for facilitating theidentification of candidate genes.

[0004] With recent advances in the identification of expressed sequencetags (ESTs) and the sequencing of the human genome, a number ofresearchers are now directing their efforts towards analyzing the datafrom the genome maps and sequences. A significant portion of thisresearch is being directed towards identifying genes which mighttrigger, prevent, ameliorate, or somehow affect a variety of diseases orphysiological states. Such genes are commonly referred to as “candidate”genes.

[0005] The identification of candidate genes is critical to entitiessuch as drug companies who may use the information related to thecandidate genes to identify better drug targets in the drug developmentprocess. The early identification of candidate genes could reduce thenumber of potential therapeutics moving through a company's clinicaltesting pipeline, significantly reducing overall costs and reducing thetime taken by the company to market the drugs.

[0006] However, conventional techniques do not facilitate easyidentification of candidate genes. This is due to the enormous amount ofinformation being generated by the researchers, and the lack of adequatetools to organize the information in a manner which facilitates analysisof the information. For example, techniques such as parallel expressionand analysis using cDNA arrays, as described in U.S. Pat. No. 5,807,522,and synthetic DNA array technology, as described in U.S. Pat. Nos.5,593,839 and 5,571,639, have been developed to study large scale geneexpression profiles (e.g. time-courses of a disease process orcomparisons between an altered physiologic or metabolic state with anuntreated biological sample). Databases and algorithms have also beendeveloped to analyze the results of the above-mentioned arraytechnologies. Public databases of metabolic, genetic and physiologicalpathways of yeast (e.g., Munich Information Center for Protein Sequences(MIPS)) and some mammalian genes (e.g., Kyoto Encyclopedia of Genes andGenomes (KEGG)) have been developed largely from the publishedliterature of many traditional low-throughput experimental studies.However, the information provided by the various sources of informationidentified above and other sources has not been integrated in a coherentmanner conducive to identification of candidate genes.

[0007] Based on the foregoing, there is a need for techniques which canfacilitate the identification of candidate genes. It is desirable thatthese techniques be able to correlate various types of information andstore it in a format which can be easily accessed or queried byresearchers interested in identifying candidate genes.

SUMMARY OF THE INVENTION

[0008] The present invention discusses techniques for facilitatingidentification of candidate genes from a plurality of DNA sequences.According to an aspect of the present invention, techniques are providedfor extracting and integrating information from various informationsources and results of various analyses, and storing the integratedinformation in a form which facilitates identification of candidategenes.

[0009] According to an embodiment, the present invention accessesresults of a homology search for the plurality of DNA sequences,annotative information for the plurality of DNA sequences indicating thebiochemical functions and physiological roles of the plurality of DNAsequences, gene expression profile data for the plurality of DNAsequences describing behavioral patterns of the plurality of DNAsequences, results from clustering the plurality of DNA sequences basedon the behavioral patterns of the plurality of DNA sequences asdescribed by the gene expression profile data, and other information.The information accessed by the present invention is stored in a format,e.g. a database, which facilitates identification of candidate genes.

[0010] According to another embodiment, the present invention receivesqueries identifying criteria for the candidate genes. In response to thequeries, the present invention searching the database storinginformation for the plurality of DNA sequences to identify a set of DNAsequences which satisfy the query criteria. The set of DNA sequences arethen output as a result of the query.

[0011] According to yet another embodiment of the present invention, auser may configuring a query identifying criteria for the candidategenes and communicate the query to a server storing information relatedto a plurality of DNA sequences. According to this embodiment, theinformation related to the plurality of DNA sequences may compriseresults of a homology search for the plurality of DNA sequences,annotative information for the plurality of DNA sequences describing thebiochemical functions and physiological roles of the plurality of DNAsequences, gene expression profile data for the plurality of DNAsequences describing behavioral patterns of the plurality of DNAsequences, results from clustering the plurality of DNA sequences basedon the behavioral patterns of the plurality of DNA sequences asdescribed by the gene expression profile data, and other information. Inresponse to the query, the user receives a first set of DNA sequenceswhich satisfy the criteria for the candidate genes identified in thequery.

[0012] The invention will be better understood by reference to thefollowing detailed description and the accompanying figures.

BRIEF DESCRIPTION OF THE DRAWINGS

[0013]FIG. 1 is a simplified block diagram of a distributed computernetwork incorporating an embodiment of the present invention;

[0014]FIG. 2 is a simplified block diagram of a computer systemaccording to an embodiment of the present invention;

[0015]FIG. 3 is a simplified flowchart showing processing performed byan embodiment of the present invention to facilitate identification ofcandidate genes from a plurality of input DNA sequences;

[0016]FIG. 4 depicts a process of performing homology analysis for aplurality of sequences according to an embodiment of the presentinvention;

[0017]FIG. 5 depicts a database schema showing information extractedfrom homology search results and stored for the query cDNA sequencesaccording to an embodiment of the present invention;

[0018]FIG. 6 is a simplified flowchart showing processing performed byan embodiment of the present invention for obtaining descriptiveannotative information for the genes;

[0019]FIG. 7 depicts a database schema showing the functional annotativeinformation stored for the genes according to an embodiment of thepresent invention;

[0020]FIG. 8 depicts a database schema showing the gene expressionprofile data stored for the genes according to an embodiment of thepresent invention; and

[0021]FIG. 9 is an exemplary look-up table for general rankings ofbiomedical journals.

DESCRIPTION OF THE SPECIFIC EMBODIMENTS

[0022] The present invention discusses techniques for facilitatingidentification of candidate genes from a plurality of DNA sequences.According to an aspect of the present invention, techniques are providedfor extracting and integrating information from various informationsources and results of various analyses, and storing the integratedinformation in a form which facilitates identification of candidategenes.

[0023] As part of the analysis, an embodiment of the present inventionanalyzes and extracts information from homology searches performed on aplurality of DNA sequences. According to another aspect, the presentinvention extracts descriptive annotative information from variousinformation stores about cDNA clones, which have been isolated on thebasis of differential expression from various disease models or alteredphysiological states. According to another embodiment, the presentinvention extracts information about causally ordered (i.e. as definedby autoregression-based causality analysis) behavioral patterns ofdifferentially expressed cDNAs from gene expression profile data.According to another embodiment, the present invention correlates thedescriptive annotative information about cDNA clones with numericalexperimental data on the behavior of the cDNAs extracted from, forexample, the gene expression profiling data. According to anotherembodiment, the present invention integrates the information to providea model to facilitate experimental testing of the candidate genes. Theinformation extracted/obtained by the present invention is stored in adatabase. According to an embodiment of the present invention, users mayquery the information stored in the database to identify candidategenes.

[0024]FIG. 1 is a simplified block diagram of a distributed computernetwork 10 incorporating an embodiment of the present invention.Computer network 10 includes a number of client systems 16-1, 16-2, and16-3, and a server system 14 coupled to a communication network 12 via aplurality of communication links 18. Communication network 12 provides amechanism for allowing the various components of distributed network 10to communicate and exchange information with each other. Communicationnetwork 12 may itself be comprised of many interconnected computersystems and communication links. Communication links 18 may be hardwirelinks, optical links, satellite or other wireless communications links,wave propagation links, or any other mechanisms for communication ofinformation. While in one embodiment, communication network 12 is theInternet, in other embodiments, communication network 12 may be anysuitable computer network. Distributed computer network 10 depicted inFIG. 1 is merely illustrative of an embodiment incorporating the presentinvention and does not limit the scope of the invention as recited inthe claims. One of ordinary skill in the art would recognize othervariations, modifications, and alternatives. For example, more than oneserver system 14 may be coupled to communication network 12.

[0025] Client systems 16 typically request information from a servercomputer system which provides the information. For this reason, serverstypically have more computing and storage capacity than client systems.However, a particular computer system may act as both as a client or aserver depending on whether the computer system is requesting orproviding information. Additionally, although the invention has beendescribed using a client-server environment, it should be apparent thatthe invention may also be embodied in a stand-alone computer system.

[0026] According to the teachings of the present invention, serversystem 14 is responsible for obtaining and storing information for aplurality of DNA sequences in order to facilitate identification ofcandidate genes from the DNA sequences. Server system 14 may store theinformation in one or more databases accessible to server 14. Thesedatabases may be locally coupled to server 14 or may be distributedacross distributed computer network 10 and accessed by server 14 viacommunication network 12.

[0027] Software modules executing on server system 14 are responsiblefor obtaining information from a plurality of information sources, andintegrating and storing the information in a manner which facilitatesidentification of candidate genes. The information sources may includedatabases accessible to server system 14, results from various analyses,published sources of information such as magazine articles, etc., andother like information sources. Server 14 also provides servicesallowing users to select, access, retrieve, or query information storedby the server.

[0028] Server 14 is responsible for receiving information requests fromclient systems 16, performing processing required to satisfy therequests, and for forwarding the results corresponding to the requestsback to the requesting client system. The processing required to satisfythe request may be performed by server 14 or may alternatively bedelegated to other servers connected to communication network 12.

[0029] According to the teachings of the present invention, clientsystems 16 enable users to access and query information stored by serversystem 14. In a specific embodiment, a “web browser” applicationexecuting on a client system enables users to select, access, retrieve,or query information stored by server system 14. Examples of webbrowsers include the Internet Explorer browser program provided byMicrosoft Corporation, and the Netscape Navigator browser provided byNetscape Corporation, and others.

[0030]FIG. 2 is a simplified block diagram of computer system 20according to an embodiment of the present invention. Computer system 20typically includes at least one processor 24 which communicates with anumber of peripheral devices via bus subsystem 22. These peripheraldevices typically include a storage subsystem 32, comprising a memorysubsystem 34 and a file storage subsystem 40, user interface inputdevices 30, user interface output devices 28, and a network interfacesubsystem 26. The input and output devices allow user interaction withcomputer system 20. It should be apparent that the user may be a humanuser, a device, another computer, and the like. Network interfacesubsystem 26 provides an interface to outside networks, including aninterface to communication network 12, and is coupled via communicationnetwork 12 to corresponding interface devices in other computer systems.

[0031] User interface input devices 30 may include a keyboard, pointingdevices such as a mouse, trackball, touchpad, or graphics tablet, ascanner, a touchscreen incorporated into the display, audio inputdevices such as voice recognition systems, microphones, and other typesof input devices. In general, use of the term “input device” is intendedto include all possible types of devices and ways to input informationinto computer system 20 or onto computer network 12.

[0032] User interface output devices 28 may include a display subsystem,a printer, a fax machine, or non-visual displays such as audio outputdevices. The display subsystem may be a cathode ray tube (CRT), aflat-panel device such as a liquid crystal display (LCD), or aprojection device. The display subsystem may also provide non-visualdisplay such as via audio output devices. In general, use of the term“output device” is intended to include all possible types of devices andways to output information from computer system 20 to a human or toanother machine or computer system.

[0033] Storage subsystem 32 stores the basic programming and dataconstructs that provide the functionality of the various systemsembodying the present invention. For example, the various modulesimplementing the functionality of the present invention may be stored instorage subsystem 32. These software modules are generally executed byprocessor 24. In a distributed environment, the software modules may bestored on a plurality of computer systems and executed by processors ofthe plurality of computer systems. Storage subsystem 32 also provides arepository for storing the various databases storing informationaccording to the present invention. Storage subsystem 32 typicallycomprises memory subsystem 34 and file storage subsystem 40.

[0034] Memory subsystem 34 typically includes a number of memoriesincluding a main random access memory (RAM) 38 for storage ofinstructions and data during program execution and a read only memory(ROM) 36 in which fixed instructions are stored. File storage subsystem40 provides persistent (non-volatile) storage for program and datafiles, and may include a hard disk drive, a floppy disk drive along withassociated removable media, a Compact Digital Read Only Memory (CD-ROM)drive, an optical drive, removable media cartridges, and other likestorage media. One or more of the drives may be located at remotelocations on other connected computers at another site on communicationnetwork 12. Information stored according to the teachings of the presentinvention may also be stored by file storage subsystem 40.

[0035] Bus subsystem 22 provides a mechanism for letting the variouscomponents and subsystems of computer system 20 communicate with eachother as intended. The various subsystems and components of computersystem 20 need not be at the same physical location but may bedistributed at various locations within distributed network 10. Althoughbus subsystem 22 is shown schematically as a single bus, alternateembodiments of the bus subsystem may utilize multiple busses.

[0036] Computer system 20 itself can be of varying types including apersonal computer, a portable computer, a workstation, a computerterminal, a network computer, a television, a mainframe, or any otherdata processing system. Due to the ever-changing nature of computers andnetworks, the description of computer system 20 depicted in FIG. 2 isintended only as a specific example for purposes of illustrating thepreferred embodiment of the present invention. Many other configurationsof a computer system are possible having more or less components thanthe computer system depicted in FIG. 2. Client computer systems 16 andserver computer systems 14 generally have the same configuration asshown in FIG. 2, with the server systems generally having more storagecapacity and computing power than the client systems.

[0037]FIG. 3 depicts a simplified flowchart 50 showing processingperformed by an embodiment of the present invention to facilitateidentification of candidate genes from a plurality of input DNAsequences. As shown in FIG. 3, processing is initiated when the serversystem 14 accesses results of a homology search from the plurality ofinput DNA sequences (step 52).

[0038] The DNA sequences which are input as queries to the homologysearch are generally complementary DNA (cDNA) sequences which have beensynthesized using isolated messenger RNA (mRNA) sequences, which are thetranscription products of expressed genes. The cDNA sequences are usedas input sequences to the homology search analysis since eDNAs representexpressed genomic regions and are thus believed to identify parts of thegenome with the most biological and medical significance.

[0039] As part of the homology search, DNA and protein sequencedatabases are searched to find sequences which are related to the inputor query DNA sequences. For example, given a set of differentiallyexpressed query cDNA sequences corresponding to the mRNA of theircognate genes, a homology search identifies known, similar and unknowngenes. A homology search is generally performed by usingcomputer-implemented search algorithms to compare the query cDNAsequences with sequence information stored in a plurality of databasesaccessible via a communication network, for example, the Internet.Examples of such algorithms include the Basic Local Alignment SearchTool (BLAST) algorithm, the PSI-blast algorithm, the Smith-Watermanalgorithm, the Hidden Markov Model (HMM) algorithm, and other likealgorithms. For example, a “blastn” program utilizing the BLASTalgorithm may be used to search the Genbank database for homologs of thequery cDNA sequences. According to an embodiment of the homology search,the query cDNA sequences may be grouped as “known,” “unknown,” or“similar” sequences. “Known” cDNA sequences include sequences withsubstantial sequence identity to existing sequence entries in a sequencedatabase, such as the GenBank database. “Unknown” cDNA sequences includesequences similar to existing sequence entries in a sequence databasebut lacking functional annotation, or those sequences with no matchingsequences in existing sequence databases. “Similar” cDNA sequencesinclude sequences for which no matches are found in the sequencedatabase, but which exhibit similarity, as defined below, to existingentries in sequence databases.

[0040] Two or more sequences may exhibit “substantial sequence identity”if the sequences have at least 70%, preferably 80%, most preferably 90%,95%, 98% or 99% nucleotide or amino acid residue identity, when comparedand aligned for maximum correspondence, as measured using a particularsequence comparison algorithm or by using visual inspection.

[0041] Several different sequence comparison techniques may be used.

[0042] According to a first technique, two sequences (amino acid ornucleotide) can be compared over their full-length (e.g. the length ofthe shorter of the two, if they are of substantially different lengths)or over sub-sequences of at least 200, about 200, about 500 or about1000 contiguous nucleotides or at least about 40, about 50, or about 100contiguous amino acid residues. According to an embodiment of thepresent invention, a query cDNA sequence may qualified as a “known” geneif the query DNA sequence meets the following stringent criteria: (1) asequence length greater than 200 nucleotides with greater than or equalto 80% identity over 70% of the query sequence length with an E-value (aprobability value of a match occurring if the sequence were randomized)of less than 1e-50; and (2) for the predicted amino acid homology,greater than or equal to 80% identity for a segment length greater than50 amino acids and an E-value of less than 1e-20. Sequences that meeteither, but not both, the DNA or protein sequence criteria may begrouped as “similar” genes after examination of the respective DNA orprotein aligments.

[0043] For sequence comparison, typically one sequence acts as areference sequence, to which test sequences are compared. When using asequence comparison algorithm, test and reference sequences are input toa computer, subsequence coordinates are designated, if necessary, andsequence algorithm program parameters are designated. The sequencecomparison algorithm then calculates the percent sequence identity forthe test sequence(s) relative to the reference sequence, based on thedesignated program parameters.

[0044] As stated above, a plurality of homology search algorithms may beused to determine optimal alignment of sequences. These include thelocal homology algorithm of Smith & Waterman, Adv. Appl. Math. 2:482(1981), the homology alignment algorithm of Needleman & Wunsch, J. Mol.Biol., 48:443 (1970), the similarity method of Pearson & Lipman, Proc.Natl. Acad. Sci. USA 85:2444 (1988), the PSI-Blast homology algorithm ofAltschul et al., Nucleic Acids Res. 25:3389-402 (1997), the computerizedimplementations of algorithms GAP, BESTFIT, FASTA, and TFASTA includedin the Wisconsin Genetics Software Package, Genetics Computer Group, 575Science Dr., Madison, Wis.), by Hidden Markov Models (HMM, Durbin, Eddy,Krogh & Mitchison, Cambridge University Press, 1998), or EMotif/EMatrixto identify sequence motifs (Nevill-Manning, Wu, & Brutlag, Proc Natl.Acad. Sci U S A. May 1998 26;95(11):5865-71), or by visual inspection(see generally Ausubel et al., supra). Each of the above identifiedalgorithms and the references are herein incorporated by reference inits entirety for all purposes. These algorithms are well known to one ofordinary skill in the art of molecular biology and bioinformatics. Whenusing any of the aforementioned algorithms, the default parameters for“Window”, gap penalty, etc., are used. Practitioners of the artmolecular biology with average skill will recognize these parameters as:(a) the “window” is typically a 9, 10 or 11 nucleotide word length ofsequence over which the homology is determined; and (b) gap penalty is ascoring value to prevent large gaps from occurring in reportedalignments.

[0045] The BLAST algorithm is well suited for determining percentsequence identity and sequence similarity. The BLAST algorithm isdescribed in Altschul et al., J Mol. 215:403-410, (1990), the entirecontents of which are herein incorporated by reference for all purposes.Several software programs incorporating the BLAST algorithm are publiclyavailable through the National Center for Biotechnology Information(NCBI) (http://www.ncbi.nlm.nih.gov/). These programs include theblastp, blastn, blastx, tblastn, tblastx, and PSI-blast softwareprograms. Due to codon wobble or species differences, more informativehomologies can be found by comparing the predicted protein sequence of aDNA query sequence to a protein sequence database. For this task, theSmith-Waterman or PSI-BLAST algorithms may be used. Similarly, for weakhomologs, functional domains of proteins may be discerned bySmith-Waterman, HMM or Emotif algorithms. Software for performing HMMand Smith-Waterman analysis can be obtained from a variety of publicsources (e.g. http://hmmer.wustl.edu/;http://www.stanford.edu/˜sntaylor/bioc218/final.htm#Appendix) and/orfrom vendors that sell accelerated computer hardware to rapidly processlarge batches of sequences (e.g. Paracel, Pasadena, Calif. orTime-Logic, Reno, Nev.). Software for EMotif/Ematrix can be obtainedfrom sources such as the Brutlag Bioinformatics Group, StanfordUniversity, Stanford, Calif.

[0046] The BLAST heuristic search algorithm is optimized for speed andsearches sequence databases accessible to server 14 for optimal localalignments to the input query DNA sequences. Databases which may besearched using the BLAST programs include the SWISS-PROT proteinsequence database, GenBank database, the Genome Sequence database(GSDB), the European Molecular Biology Laboratory (EMBL) NucleotideSequence database, the DNA Database of Japan (DDBJ), and other likedatabases.

[0047] The BLAST algorithm identifies high scoring sequence pairs (HSPs)by identifying short words of length “W” in the query cDNA sequence,which either match or satisfy some positive-value threshold score “T”when aligned with a word of the same length in a database sequence. “T”is referred to as the neighborhood word score threshold (Altschul et al,supra). An “X” parameter is a positive integer representing the maximumpermissible decay of the cumulative segment score during word hitextension. These initial neighborhood word hits act as seeds forinitiating searches to find longer HSPs containing them. The word hitsare then extended in both directions along each sequence for as far asthe cumulative alignment score can be increased. Extension of the wordhits in each direction are halted when the cumulative alignment scoregoes to zero or below, due to the accumulation of one or morenegative-scoring residue alignments, or when the end of either sequenceis reached. The BLAST algorithm parameters “W”, “T”, and “X” determinethe sensitivity and speed of the alignment. Accordingly, the stringencyof a BLAST search can be adjusted by appropriately setting the searchparameters. However, if the search parameters are too loose, anexcessive amount of biologically questionable “hits” may be returned.The BLAST program uses as defaults a wordlength (W) of 11, the BLOSUM62scoring matrix (see Henikoff & Henikoff, Proc. Natl. Acad. Sci. USA89:10915 (1989)) alignments (B) of 50, expectation (E) of 10, M=5, N-4,and a comparison of both strands. Typically, the default parameters canyield from zero to scores of likely homologs for the input query DNAsequences.

[0048] In addition to calculating percent sequence identity, the BLASTalgorithm also performs a statistical analysis of the similarity betweentwo sequences (see, e.g. Karlin & Altschul, Proc. Natl. Acad. Sci. USA90:5873-5787 (1993)). One measure of similarity provided by the BLASTalgorithm is the smallest sum probability (P(N) or E-value as anexpected value), which provides an indication of the probability bywhich a match between two nucleotide or amino acid sequences would occurby chance. For example, a nucleic acid is considered similar to areference sequence if the smallest sum probability in a comparison ofthe test nucleic acid to the reference nucleic acid is less than about0.01, more preferably less than about 0.001, and most preferably lessthan about 0.0001.

[0049] A further indication that two nucleic acid sequences orpolypeptides are substantially identical is that the polypeptide encodedby the first nucleic acid is immunologically cross reactive with thepolypeptide encoded by the second nucleic acid. Thus, a polypeptide istypically substantially identical to a second polypeptide, for example,where the two peptides differ only by conservative substitutions. Thesepolypeptide sequence comparisons are enabled by the Smith-Waterman, HMMand EMotif algorithms.

[0050] As is well known to one of ordinary skill in the art, resultsfrom a homology search or analysis includes: a plurality of cDNA querysequences; a list of homologous (target) sequences; an E-Value thatdescribes the probability that the original (query) sequence match withthe target sequence could occur randomly; the annotation of the targetsequence, if provided; an alignment of the query sequence to each targetsequence; the percent identity of the query sequence to the targetsequence; the hit length, or length of the sequence over which thepercent identity is determined.

[0051] The complete homology analysis of a plurality of sequencesaccording to an embodiment of the present invention is composed of aprocess described in FIG. 4. The output(s) from the process shown inFIG. 4 may be used as the input to step 52 in FIG. 3. The rationale forthis sequential strategy of homology analysis is to automate the methodof sequence classification. According to the embodiment shown in FIG. 4,input sequences 80 are subjected to BLAST analysis 82 against aninternal database of cDNA sequences 84. Near identical homologs(E-value<1e-80) are sieved and recorded as being strong homologs ofpreviously classified entries 86 of the internal database. Thosesequences failing this test, are subjected to blastn analysis 88 againstthe GenBank nucleotide (NT) and patent databases 90. Those sequencesshowing strong similarity (E-value<1e-50 with sequence length>200nucleotides, 80% identity over 70% of the query sequence length) areclassified as “known” genes 92. Those sequences failing this test aresubjected to Smith-Waterman analysis 94 against the protein databases ofSwiss-Prot and the translated patent database 96. Those sequences withE-values<1e-20 with 80% identity over a segment length>50 amino acidsare classified as “known” genes 98 while sequences with an E-value>1e-20are subjected in parallel to (a) HMM 102 and EMotif 100 analysis againstthe Swiss-Prot and GenBank non-redundant (NR) protein databases 104 and(b) BLASTN analysis 106 against the GenBank EST and genomic databases108. Those sequences with an E-value<1e-9 after HMM or EMotif are scoredas “Similar” genes 110 while sequences with an E-value<1e-60 after thefinal BLASTN analysis 106 are classified as “unknown” 112. Any sequencesfailing this last test, are classified as “Novel” 114.

[0052] The present invention extracts relevant information from thehomology analysis output as described above for each input DNA sequence,organizes the information, and stores it in a format which facilitatesfurther processing and analysis of the information (step 54). Accordingto an embodiment of the present invention, the information extractedfrom the BLAST, Smith-Waterman and HMM search output is stored in adatabase. The information extracted and stored by the present inventionduring step 54 is shown by the database schema depicted in FIG. 5. FIGS.7 and 8 depict other database structures for storing informationaccording to an embodiment of the present invention.

[0053]FIG. 5 shows information (database table “HomologyResults” 120)which is extracted from the homology search results, and stored for eachquery cDNA sequence according to an embodiment of the present invention.It is important to note that multiple (typically 10) homologs for eachquery sequence are stored in this database table in order to facilitateextraction of the most descriptive and accurate annotation for the querysequence. It should also be evident that various other formats, inaddition to tables and databases, may also be used to store theinformation. The following scenario is common: the top 1, 2, 3, 4 or 5blastn homologs of a query have E-values within a 10-fold range andare<1e-50 yet lack informative annotative information (e.g. suchhomologs are expressed sequence tags or genomic DNA). However, thesecond, third, fourth, fifth, sixth or seventh homolog's E-values mighthave the following attributes: the E-value is less than 1e-50 and iswithin 10 or 100 fold of the top hit but the weaker homolog's annotationmight provide more informative description of the query sequence's roleor function; e.g. the weaker homolog might be an enzyme, receptor orstructural protein. Identification of these more accurate descriptionsare facilitated by a combination of keyword tables and informationextraction methods described herein. In these circumstances, those ofnormal skill in the art of bioinformatics will recognize that the weakerhit provides the most useful annotation provided the E-value meets theabove criteria.

[0054] For each homolog, the present invention stores, in databasetables “DNAsequence” 130 and “HomologyResults” 120, the name of thesequence (attribute “seqFile” 130-a and 120-a), the sequence (“Sequence”130-b), the quality scores or Phred values (Ewing, Hiller, Wendl &Green, Genome Research, 8:175-185, 1998), (“QualityScores” 130-c), theaccession number of any homolog, i.e. the GenBank identifier number(“GID” 120-e), the best GID derived from BLAST analysis (“BestBlastnGID”130-f), the best GID derived from BLAST against the patent DNA databaseanalysis (“BestPatent-GID” 130-g), the best GID derived fromSmith-Waterman analysis derived from the Swiss-Prot database(“BestSW-GID” 130-h), the best GID derived from Smith-Waterman analysisof the patent (database “BestPatent-SW-GID” 130-i), the best GiD derivedfrom the best human homolog in BLAST analysis (“BestHumanBlastn-GID”130-j), and the best GID derived from the best human homolog derivedfrom Smith-Waterman analysis (“BestHuman-SW-GID” 130-k). For anyhomolog, the algorithm (e.g. BLAST or HMM) used for the homology searchis recorded (“Algorithm” 120-b), the frame of the predicted protein forprotein comparisons (“Frame” 120-c), the database searched (“Database”120-d), the GenBank annotation for any homolog (“HitDescription” 120-f),the species of the annotation (“Species” 120-g), the E-value (“E-value”120-h), the length of the alignment region (“AlignLength” 120-i), thepercent identity of the aligned sequences (“PercentIdentity” 120-j), thelength of the query in the alignment (“QueryLength” 120-k), the lengthof the target in the alignment (“TargetLength” 120-l), a numberrepresenting the fraction of the total query length represented in thehit region (“ALength/QLength” 120-m), the start position of the querysequence in the alignment (“QueryStart” 120-n), the position of the endof the query (“QueryEnd” 120-o), the start position of the targetsequence (“TargetStart” 120-p), the end position of the target sequence(“TargetEnd” 120-q), the query sequence in the alignment (“QSequence”120-r), the consensus of the alignment (“Consensus” FIG. 120-s), and thetarget sequence in the alignment (“TSequence” 120-t).

[0055] Referring back to FIG. 3, server 14 then obtains (step 56)descriptive annotative information on the biochemical function(s) andthe physiological role(s) for the known genes from the plurality of cDNAsequences and stores the information in the database (step 58). FIG. 6depicts a simplified flowchart 140 showing processing performed by anembodiment of the present invention for obtaining descriptive annotativeinformation for the known genes. As shown in FIG. 6, several differenttechniques may be used by the present invention to obtain the functionalinformation. According to a first technique, the present inventionaccesses information sources containing functional information relatedto the known genes (step 142). The information sources may includearticles, published material, and other like material accessible toserver 14. According to a specific embodiment, the present invention mayuse the accession numbers or the GenBank identifiers (GIDs) associatedwith the DNA sequences and their homologs to find the publishedmaterial. Text processing tools may then be used by the presentinvention to automatically extract functional information from theinformation sources accessed in step 142 (step 146). The extractedinformation may then be summarized (step 148) and stored in the database(step 150).

[0056] According to another technique, the present invention may obtainthe functional information from databases storing functional informationand which are accessible to server 14 (step 144). Examples of suchdatabases include databases provided by Proteome of Boston, Mass.,DoubleTwist of Oakland, Calif., the Genbank database of deposited DNAand protein sequence data (http://www.ncbi.nlm.nih.gov:80/entrez/), theSWISS-PROT protein database (http://www.expasy.ch/sprot/), the PubMed orMedline (NCBI) (http://www.ncbi.nlm.nih.gov) databases of abstractsderived from thousands of peer-reviewed biomedical journals, and otherlike databases. The Proteome databases are concise descriptions of knowngenes, their protein products and their functions and roles and knowninteractors as described in the current literature. The informationextracted from the published material and genomic databases may then besummarized (step 148) and stored in the database (step 150).

[0057] The GenBank record of a cDNA or gene sequence commonly containsreferences to peer-reviewed publication information, stored in theMedline database about the gene. The Medline database can be accessedvia the Internet via the PubMed interface(http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi). Alternatively, theGenBank record contains informative keywords related to the gene whichmay be used to perform broad topic searches on the Medline database. Forexample, protein products of genes participate in many processesessential to metabolism, development and reproduction. In some cases, aprotein encoded by a gene may have more than one function and/or morethan one role. For example, the yeast inositol 1-4-5 triphosphate kinaseenzyme adds a phosphate moiety to phosphoinositol—an important componentinvolved in signaling. However, this protein also can act as aregulatory scaffolding protein for transcription factors in the nucleus(Audrey R. et al. Science 287:2026-2029, 2000). Thus, this singleprotein can function as both an enzyme and a structural protein.Similarly, this gene product has two roles: it can participate insignaling processes and mRNA transcription. These instances are alsoexamples of general pathways but further annotative information from thepublished literature could refine these topics to even more specificpathways. For example, the enzymatic activity might be most importantfor a growth hormone pathway and the structural role might be moreimportant to a specific subset of transcription factors engaged incontrolling cell division. In this invention, these relational linksbetween genes and cellular or organismal processes constitute a web ofinteracting pathways that are extracted accurately and comprehensively.

[0058] The biological demands for information extraction from publishedmaterial, such as abstracts, etc., in a comprehensive and consistentmanner is unique to the world of manual biological annotation.Traditionally, extraction of information was done manually with varyingdegrees of consistency and accuracy. With recent advances in informationextraction technologies, various software programs have been developedto automate information extraction and to summarize the extractedinformation. Examples of such programs include programs provided byInxight Corp. of Santa Clara, Calif. Another example of a softwarepackage for information or knowledge extraction is theCrystal-Badger-Marmot suite from the Center for Intelligent InformationRetrieval, Univ. of Massachusetts, Amherst, Mass. Such software programshave been applied to extract information from abstracts of publishedpapers as well as from full-text papers. According to an embodiment ofthe present invention, these techniques are applied to generate tablesof genes, tables of pathways composed of genes, and tables ofrelationships between and amongst genes and pathways. As describedbelow, the validation of a relationship between or amongst genes isevaluated in a quantitative fashion.

[0059] According to an embodiment of the present invention, informationextraction programs, such as those discussed above and others, may beused to extract (step 146 in FIG. 6) descriptive annotation informationfrom information accessible to server 14 and to summarize (step 148 inFIG. 6) the information. According to an aspect of the presentinvention, the annotative information is stored in a database.

[0060] According to the present invention, information is extracted andstored for both the majority views and potentially multiple minorityviews. This is due to dramatic shifts in the understanding of biologicalsystems over time. These shifts are also referred to as “paradigmshifts” (Kuhn, T., The Structure of Scientific Revolutions, Univ.Chicago Press 1962). According to these paradigm shifts, a minority viewbecomes accepted as being the correct interpretation after critical newdata is acquired. The change in accepted “truth” of a paradigm can bedramatic or subtle in various domains of knowledge, and in the realm ofbiology both extremes can occur—hence the need for comprehensivecollections of entity-relationships amongst genes, functions, roles andpathways. The need for storing both the majority and minority viewsbecomes important when one realizes that the laws of biology are not yetdeterministically known. This is substantially different from prior artbioinformatics techniques which only stored information related to themajority view (e.g. T. Rindflesch, L. Tanabe, J. Weinstein & L. HunterPSB 2000:517-528).

[0061] For example, for a given biological topic, perhaps 51, 75, or 90out of 100 published abstracts may describe a phenomenon as being causedby the interactions of genes A and B whereas a smaller subset ofabstracts, perhaps 10, 25 or 49 may describe a more complex interactionbetween genes A and C prior to gene B. The former A-B model would beconsidered the consensus, “majority view” model (a “truth”) and thelatter A-C-B model would be considered a “minority view” and likelyregarded as being “false.” According to traditional bioinformaticstechniques, only information related to strict “truths” was maintainedand information related to the minority view(s) was discarded to reducethe amount of data being stored.

[0062] According to an embodiment of the present invention, minorityviews (e.g. unusual or unexpected relationships between genes ormetabolic pathways) are also stored in the database but assigned a lowerreference score (see “RefScore ” attribute 200-k in table “Reference”200 in FIG. 7, “FunctionScores” attribute 170-g in table “Function” 170,“RoleScores” attribute 180-1 in table “Role” 180, and attributes 220-athrough 220-f of table “RefScore” 220.) associated with the descriptiveannotation of the known genes from the plurality of cDNA. The referencescore (or their summary scores, “FunctionScores” 170-g and “RoleScores”180-1) quantizes the “acceptance/majority opinion” for an alleged roleor function of a gene. Of particular importance to “minority” views isthe extraction and recording of special circumstances or boundaryconditions under which the phenomena or relationship amongst genes mightexist. For example, information related to minority views (e.g. unusualor unexpected relationships between genes or metabolic pathways) isstored in the database but assigned a lower reference score thaninformation associated with the majority view. The metric for evaluatinga specific published reference article also assigns a score derived fromthe Citation Index database (Institute for Science Information,Philadelphia) which quantitatively ranks the impact of a given paper bythe number of times that paper is subsequently referenced. For the mostsignificant papers, a published article can be referenced thousands oftimes. The Citation Index also ranks journals with high impact but onlyfrom the same criteria of frequently-cited papers from the journalsregardless of whether the published paper is ultimately revised or shownto be inaccurate or limited to a set of conditions. Hence, oneembodiment of this invention provides a mechanism to take into accountthe quality of the information source. This is both general and aspecific measure. In general, articles in journals respected by aconsensus of biomedical and genomics practitioners are believed to bereliable. For example, a publication in journals with a recognized,rigorous peer-review process (e.g. Science, Nature, the Journal ofBiological Chemistry, or the Journal of Clinical Investigations) wouldreceive 100 points or >90 points whereas publication in “lesser”journals (e.g. Journal of Antisense Research or Experimental CellResearch) would only receive 10 or 40 points.

[0063]FIG. 9 is an exemplary look-up table for general rankings of suchbiomedical journals. However, scores from FIG. 9 may be adjusted becausethe information source's peer-review process can be dependent upon thereviewers for a given domain or the degree of democratic consensus of ajournal's editorial board. A domain specific weighting factor is derivedfor the major journals and can be applied systematically while in othercases, a human annotator must make the judgment. The adjustment canrange between 10 and 50% of the original score and an article in a“lower-quality” journal can be upgraded or an article in a“higher-quality” journal can be downgraded.

[0064] While subject to a degree of subjectivity, these standards forranking journals and their domain preferences are the same as those usedby faculty-tenure review committee in major medical schools in theUnited States of America in order to evaluate the publication record ofa tenure-candidate. Similarly, human experts in various domainsrecognize that certain information sources can have a predisposition todisregard or highly regard certain authors or types of submitted work.Since the editorial board and peer-reviewers of journals change withtime, the tables for grading journals are not static but must be revisedover time as reviewers or editors specific to domain specialties change.In combination with the Citation Index of impact journals, thesecriteria enable the scoring of a reference's support of gene'sannotation.

[0065] Another variable used in the evaluation of the experimentalsupport for an alleged role or function for a gene is a “follow-on”parameter. Reliable experimentalists often will publish a series ofpapers in reputable journals. They may publish on the same gene orencoded protein (“GeneRef” 230-a attribute of table “FollowOnWork” 230in FIG. 7, or “ProteinRef” 230-b), a close homolog (“FamilyMemberRef”230-c), another gene in the same pathway (“PathwayRef” 230-d) or thesame gene or pathway in another organism (“altOrganismRef” 230-e). Whena large body of work from an individual author or group of authorsaccumulates, then the probability of “truth” is high. In contrast, asingle publication by an author that alleges unusual relationshipsamongst genes that fails to engender follow-on work (as roughly measuredby the Citation Index) by the original author or others has a lowerprobability of “truth” which is reflected by a lower reference score(“RefScore” 200-k). An intermediate reference score occurs where asingle publication triggers much work by other investigators, e.g. ahigh Citation Index but low “follow-on” value. Thus, this strategycompensates for the overall weakness of the Citation Index—by merelyenumerating the occurrences of a referenced paper, the Citation Indexmay not be accurately represent the relatedness of subsequent work.

[0066]FIG. 7 depicts the functional annotative information stored forthe genes according to an embodiment of the present invention. Databasetables 160, 170, 180, 190, 200, 210, 220, and 230 depicted in FIG. 7include annotation information derived from peer-reviewed articles andother information accessed by server 14. A table of the annotationsummary (“AnnotationSummary” 160) includes the sequence name (“SeqFile”160-a), best hits (“BestHits” 160-b) which refers to the “DNAsequence”table 130 (“BestBlastnGID” 130-f), a link to the “Function” table 170(“Function” 160-c), a link to the “Role” table 180 (“Role” 160-d), alink to the “Evidence” table 190 (“Evidence” 160-e). The Function 170,Role 180 and Evidence 190 tables contain many attributes which all referto individual References (“Reference” table 200). Any reference in“Reference” table 200 (“RefID” 200-a) that supports the concept that agene is an enzyme (“EnzymeRef” 170-a), a receptor (“ReceptorRef” 170-b),a channel or transporter (“ChannelRef” 170-c), a protein interactor(“InteractorRef” 170-d), a structural protein (“StructuralRef” 170-e), anucleic acid binding protein (“NucleicAcidBindingProtein” 170-f), has arole in cognition (“CognitionRef” 180-a), or a role in development(“DevelopmentRef” 180-b), or a role in endocytosis (“EndocytosisRef”180-c), a role in exocytosis (“ExocytosisRef” 180-d), or a role inMetabolism (“MetabolismRef” 180-e), or a role in regulation(“RegulationRef” 180-f), or a role in reproduction (“ReproductionRef”180-g), or a role in signaling (“SignallingRef” 180-h), or a role in RNAsplicing (“SplicingRef” 180-i), or a role in vesicle trafficking(“TraffickingRef” 180-j), or a role in transcription (“TranscriptionRef”180-k) is duly linked to the appropriate reference identifier (“RefID”200-a). The weighted scores for each of these possible functions isstored as a multi-item list (“FunctionScores” 170-g). Similarly, theweighted scores for each of the possible roles is stored as a multi-itemlist; e.g. a “RoleScores” (180-1) equivalent to“0,100,100,0,0,0,0,0,0,0,0” might correspond to a single publishedarticle on a gene's role in the endocytosis of key nutrients duringdevelopment in a prominent journal such as Science (“DevelopmentRef”180-b and “EndocytosisRef” 180-c). In a database query, such a summaryweighted score can be simply compared to other scores by both themaximum value of each comma-delimited item as well as the rank orderamongst comma-delimited items. Similarly, any experimental evidencecontained in the reference that shows that a gene's encoded protein wasimmune precipitated (“ImmunePrecipRef” 190-b), a gene's encoded mRNA washybridized in a Northern assay (“NorthernRef” 190-c), a gene washybridized in a Southern blot (“SouthernRef” 190-d), a protein band ofappropriate predicted size was identified in a Western blot(“WesternRef” 190-e), an open reading frame was identified in a yeasttwo-hybrid interactor analysis (“InteractorAnalysisRef” 190-f), anenzymatic assay (“BiochemistryRef” 190-g), a pharmacological profile wasdetermined (“PharmacologyRef” 190-h), a predicted homologous domain(“HomologyRef” 190-j) or a predicted structural 3-dimensional motif(“StructureRef” 190-k) is duly referenced to the appropriate referenceidentifier (“RefID” 200-a).

[0067] Referring further to FIG. 7, tables are shown to record theinformation about any pathway or reference . For any pathway (“Pathway”210-a in table “Pathway” 210), a role may be assigned (“Role” 210-b),genes of the pathway listed (“GeneList” 210-c) and the location of thepathway identified (“Locations” 210-d). For any reference, a uniqueidentifier (“RefID” 200-a) is recorded, the authors listed (“Author”200-b), the article title (“Title” 200-c), the journal in which thearticle was published (“Journal” 200 d), the volume of the journal(“Volume” 200-e), the page numbers of the article (“Page” 200-f), theyear of the article's publication (“Year” 200-g), and the referencescore link (“RefScore” 200-k). The reference score link 200-kcorresponds to the “RefScore” object/table 220 which also contains thereference identifier (“RefID” 220-a), the citation index value (if any)(“CitationIndex” 220-b), the topic field (e.g. immunology orneurobiology) (“Domain” 220-c), a domain weight-adjusted value for thejournal quality, as described above, (“JournalRigor” 220-d), and thelink to follow-on work table 230 (“FollowOnWork” 220-e). The follow-ontable 230 consists of a reference to any subsequent work in which thesame gene (“GeneRef” 230-a) or protein (“ProteinRef” 230-b), orhomologous gene (“FamilyMemberRef” 230-c), or the same pathway(“PathwayRef” 230-d) or alternate organism (“altOrganismRef” 230-e) wasstudied by the original investigators.

[0068] Referring back to FIG. 3, the present invention then obtains(step 59) and stores (step 60) expression profile data for the genes andtheir homologs. The expression profile data for a gene describes how thegene is expressed, or transcribed to RNA. Profiles can be created forgenes in cells or tissues under influences of a drug, as a cell ortissue develops, or during changes to the physiological state of thecell or tissue, or in response to the development of disease in humansor an animal model. For example, the expression profile data mayindicate whether a gene is up-regulated/down-regulated during a stroke.

[0069]FIG. 8 depicts the gene expression profile data stored in thedatabase according to an embodiment of the present invention. The fourtables depicted in FIG. 8 correspond to a summary of the array resultconditions (“ArrayResults” 240), the summarized array data (“ArrayData”250), the details of the probe(s) (“Probe” 260), and the raw data(“RawData” 270). The array result conditions table 240 containsattributes that describe a unique experimental identifier (“ExptID”240-a), the corresponding bar code (“BarCode” 240-b), the link for probe1 (“Probe1” 240-c), the link for probe 2 (“Probe2” 240-d), a term thatdescribes the grid pattern (“GridPattern” 240-e), the clone setidentifier (“CloneSet” 240-f), the link to array data (“ArrayData”240-g), and a comment (“Comment” 240-h). The array data table 250contains attributes to describe the experimental identifier (“ExptID”250-a), the name of the cDNA sequence (“seqFile” 250-b), the arithmeticmean of the background or normalized data (“Mean” 250-c), the standarddeviation (“StdDev” 250-d), the ratio of any paired means derived fromsimultaneous application of two probes (“Ratio” 250-e), the time pointat which the probes were made (“TimePt” 250-g), the biological state(e.g. diseased or normal) of the probe's mRNA origin (“State” 250-h),the clustering method (“ClusterMethod” 250-i), the cluster number(“Cluster” 250-j), the total number of clusters (“TotalClusters” 250-k),the cluster order pattern derived from the auto-regression analysis usedin the causality analysis (“ClusterOrder” 250-1) and the date of theclustering (“ClusterDate” 250-m).

[0070] The probe data table 260 contains attributes for the probeidentifier (“ProbeID” 260-a), the date of probe generation (“Date”260-b), the type (first strand cDNA or double-stranded cDNA) of probe(“Type” 260-c), the biological model (“Model” 260-d), the identifier forthe preparation of RNA (“RNAprep” 26-e), the labeling (radioactive orfluorescent) method (“LabelType” 260-f), the time point at which the RNAwas collected (“TimePt” 250-g), the biological state of the probe's mRNAorigin (“State” 250-h), and a comment (“Comment” 260-i).

[0071] The raw data table 270 contains attributes for the experimentalidentifier (“ExptID” 270-a), the sequence name (“seqFile” 270-b), theprobe name (“Probe” 270-c), the raw intensity value (“RawValue” 270-d),the local background or normalization factor (“LocalBgnd/factor” 270-e),and the arithmetically corrected intensity value (“CorrectedValue”270-f).

[0072] Referring back to FIG. 3, the present invention then performsclustering analysis on the behavior of DNA sequences in expressionprofile studies (step 62). According to clustering analysis, datacomplexity is reduced by partitioning the genes into groups or“clusters” that have similar attributes. These attributes can be thebehavior of genes monitored over multiple time points in response to aninjury, onset of disease or altered physiological state (e.g. intensityor ratio of intensities resulting from hybridization of a gene set withprobes derived from normal and diseased tissue). Also, these attributescan simply be the response of genes from cells, tissues or animalstreated with multiple concentrations (e.g. 5, 6 or 7 concentrations) ofmany drugs (e.g. 10, 100, 1000 or 10,000) with differing mechanisms ofaction at a single time point. These attributes can also be the responseof cells or animals subjected to many altered physiological states (e.g.elevated or diminished nutrients, ions or temperature, transientischemia, shock, anxiety, discomfort or depression) monitored at asingle time point relative to untreated cells or tissues. The result ofclustering gene expression data are clusters of genes with similarexpression profiles.

[0073] An embodiment of the present invention implements a method ofgene clustering that is tuned to the simplified, yet specific nature ofthe array data itself. In order to reduce data complexity, manyclustering methods have been applied to gene expression profile data:these include hierarchical, K-means, self-organizing maps (Tamayo et al.PNAS 96:2907-12), or support vector machines (M. Brown et al. PNAS97:262-7). An embodiment of the present invention uses a K-meansdistance with Euclidean distance or other distance metrics (provided byPartek of St. Louis Mo.) because of its ability to efficiently clusterdata in an automated unsupervised manner. One of the common criticismsof K-means clustering is that the number of clusters must be determineda priori. However, the present invention uses the Davies-Bouldinalgorithm (IEEE Transactions on Pattern Analysis and MachineIntelligence, Vol. PAM1-1, April 1979) which determines the optimalnumber of clusters based upon the dispersion and flatness of clusters.

[0074] According to an embodiment of the present invention, the presentinvention may cluster the genes based on time-course data as describedby the expression profile data. According to a specific embodiment ofthe present invention, packages provided by Partek Inc. and/or SASInstitute, Incorporated of Cary, N.C. may be used to perform theclustering analysis. For time-course data, the clustering analysis mayalso include causality analysis to predict ordered relationships betweenclusters on a time basis. Causality analysis is performed using aregressive method performed with software packages such as theStatistical Analysis Software from SAS Institute, Incorporated. Theresults from the clustering analysis are stored in a database (step 64).The cluster analysis results are inserted into the array data table 250of FIG. 8: for each gene (“seqFile” 250-b), the clustering method(“ClusterMethod” 250-i), a cluster number (“Cluster” 250-j), the totalnumber of clusters (“TotalClusters” 250-k), and the cluster order(“ClusterOrder” 250-l).

[0075] The type of clustering method(s) used to analyze array datadepends upon (a) a priori knowledge about the behavior of theimmobilized genes, (b) the composition of the gene set itself, and (c)the choice of array technologies. Array technologies come in two generalforms: cDNA and oligonucleotide arrays. Since the Affymetrix arraysoften have a higher density than cDNA arrays, the emphasis has been toincrease the number of sequences per unit surface area in order to gainthoroughness. Often times, inadequate attention is paid to the design ofthe actual DNA attached to the array. Thus, many array chip designs seekto deposit large numbers of gene fragments per chip; such asspecies-specific chips (mouse, rat or human chips from Affymetrix, SantaClara, Calif.) or genes representative of a field (apoptosis, cancer orneurobiology chips from Affymetrix or Clonetech, Palo Alto, Calif.However, analysis of such chips is complicated by the fact that mostgenes on the chip may be irrelevant to the biological system beingstudied.

[0076] According to the present invention, the analysis of gene clustersis vastly simplified by the immobilization of a plurality of genes thatare actually disease- or physiologically-specific. Such collections ofgenes can be generated by any method that enables the identification ofgenes expressed at a measurable level higher in one state than another.For example, in tumors or animals subjected to ischemia, those skilledin the art of molecular cloning can identify and isolate cDNA clones andderive the sequences thereof for genes whose expression is elevated 2, 3or 10 fold higher in the altered physiological state; e.g. differentialdisplay and subtractive cloning are two such methods. The number ofdisease-related or physiologically-related genes may range from 1000,6000, 10,000, or 20,000 per chip.

[0077] When analyzed by principal components analysis, typically 90% ofthe variability in the gene expression profile data generated by arraysof 6000-10,000 disease- or paradigm-specific cDNA targets can beexplained by the first 3 principal components or eigenvectors. With alarge number of genes unrelated to the biological paradigm of the probe(e.g. 40,000-60,000 genes present on some Affymetrix arrays), the datavariability is likely explained by many more principal components whichmakes it difficult to analyze more than any 3 of all principalcomponents in 3-dimensional space. For these instances, other clusteringmethods might be more appropriate, such as hierarchical clustering.However, optimal hierarchical clustering is highly iterative and falseclusters are often generated.

[0078] In order to infer the time-order of gene clusters derived fromthe above, it is possible to calculate likely causality by a movingauto-regressive analysis. A time-order is a linear ranking of clustersby a deduced set of relationships ordering the first possible clusterrelative to other clusters in an iterative process. A biological exampleof this problem is the goal of understanding which genes respondearliest to an injury or infection followed by the elucidation of timeof activation of subsequent, related or unrelated genes. A ordered setof clusters from expression profile data is achieved initially byselecting a representative subset of genes near the centroid of eachcluster (e.g. 2, 5 or 10 representing 1-10% of the total number ofgenes) and performing a moving auto-regressive test against theremaining genes of the monitored population of genes (e.g. 2, 5 or 10genes compared to all 6000 or 10,000 genes) from all clusters(Statistical Analysis Software of SAS Institute, Incorporated, Cary,N.C.). The ranked order of clusters is stored in “ClusterOrder” (250-l)in step 64.

[0079] The accuracy of ordering clusters is dependent on thecompleteness of the calculation, but calculation of cluster order iscomputationally intensive. For example. according to a specificembodiment, the above calculation requires about 24 hours on a standardsingle CPU Unix workstation with 1 gigabyte of RAM; e.g. a Sun Ultra10workstation with 300 MHz CPU. This time-series analysis is onlyapplicable to datasets with regularly spaced time-points (e.g. 10, 20 or40 instances spaced 30 min, 1 hr or 3 hrs apart). The time-resolution ofthe causality analysis is dependent upon the density of intervals overthe entire course experimental course. For the highest resolution oftime-ordered relationships amongst clusters, 20, 50, or 100 time-pointsare preferable. For the highest accuracy amongst clusters, acomprehensive auto-regression is calculated provided sufficient computerpower (e.g. 6000 genes compared to 6000 genes or 10,000 genes comparedto 10,000 genes requires supercomputer ability or the efforts of acluster of workstations such as Beowulf: (http://www.beowulf.org/)).

[0080] Referring back to FIG. 3, after the clustering analysis, thepresent invention may obtain pathway information (step 65) for the genesand their homologs and store the pathway information to the database(step 66). Pathway information can be accessed from public pathwaydatabases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) orthe Munich Information Center for Protein Sequences (MIPS), or derivedfrom the literature using information extraction methods, as describedearlier.

[0081] According to an embodiment of the present invention, the databaseused for storing information associated with the genes correlates theannotative information with numerical gene expression profile data.Within each time-resolved cluster of genes with similar behavior,multiple types of genes may exist (“Cluster” 250-j is linked to“seqFile” 250-b which can be referenced to the annotation summary“seqFile” 160-a). For example, genes that are stimulated immediatelyafter an injury or stress might include chaperones or heat shockproteins in order to prevent misfolded proteins. Similarly,transcription factors might be triggered to increase the production ofprotective systems. All of these genes' mRNA levels could be elevatedwithin the first 5 minutes post-injury but their mRNA levels mightdiminish at varying rates. Subsequently, secondary and tertiary groupsof genes might be activated in response to the transcription factors.While the clustering and causality analysis described above can identifygroups of early onset genes, it cannot distinguish the functionalrelationship, if any, between differing kinds of genes within eachtime-ordered group. For this task, integration of the annotation of allgenes for each time-ordered group is necessary. Currently, such analysesare performed by human experts and are limited by recall while adatabase query constrained by user-defined parameters could present allpossible cross-connections that are likely or less likely—depending uponthe reliability threshold (“FunctionScores” 170-g, “RoleScores” 180-1)for “truth” of a relationship defined by the user. Thus, multiplealternate scenarios can be presented in a database or in tabular form orgraphical objects linked by lines that purport directional control andannotative text describing the likelihood of the interaction along withhyperlinks to relevant published articles via HTML (hypertext markuplanguage) methods.

[0082] A feature of the present invention is that it provides supportfor both intra- and inter- time-resolved gene cluster components; i.e.between or amongst genes in subsequent or previous groups of genes.Thus, a human expert can choose from a palette of options to refine afirst iteration of gene network or pathway building. The parameters inturn can be used to recalculate the likelihood of other annotations andpathways to explain the behavior of a single gene, group of genes, orcluster of genes. Collectively, these methods can reduce the number ofdifferentially regulated genes to a smaller plurality; from whichcandidate genes can be chosen by the human expert.

[0083] The information stored in the database according to the presentinvention facilitates the identification of candidate genes (step 68 inFIG. 3). Identification of candidate genes results from the merge of thetime-ordered gene expression clusters and the function(s), role(s)and/or pathway(s) information of the cluster members. The referencescore-based assignments for either majority or minority view annotationsof function(s), role(s) and/or pathway(s) enables the identification ofnew or serendipitous relationships. Such biological novelty, i.e. theunexpected up- or down-regulation of a gene in the context of anexisting or new pathway, is one of the hallmarks of candidate genes. Forexample, in a signaling pathway, study of a disease model may revealthat one, two or three known phosphodiesterases are up-regulated in thecontext of a pathway not normally characterized by those enzymes. Or, anew family member of this enzyme class might be discovered up-regulatedalong with the expected enzyme. Both are examples of candidate genesrevealed by the combination of annotated DNA sequences and expressionprofiling data—particularly if the published literature contained anobscure reference to such a relationship under abnormal circumstancesdissimilar to the conditions of the experimental paradigm. The latterresult would be significant due to the redundancy of biological systems.Conversely, if 7, 8 or 9 of 10 genes of a well known pathway are foundto be up-regulated in a disease or injury model (as determined by acomparison of all pathways of each gene expression profile cluster),then the 1, 2 or 3 genes that failed to be induced (as determined by aquery comparison to the pathway database) might also be consideredcandidate genes. In this example, the user might conclude that a newinhibitor is blocking the 1, 2, or 3 missing genes and hence blockingthe inhibitor might diminish the pathology or improve recovery. The usermight then search for known or postulated inhibitors of any member ofthe pathway.

[0084] The information stored in the database may be accessed or queriedby users interested in identifying candidate genes. According to aspecific embodiment, the present invention provides an interfaceallowing users to specify a query including criteria characterizingcandidate genes. In response to the user query, the present inventionsearches the database to identify genes which satisfy the user-specifiedsearch criteria. A typical search might examine the group of classifiedgenes (e.g. by function, role or pathway) appearing in an early ormiddle expression cluster (based on “Cluster” 250-j and “ClusterOrder”250-l). By comparing the similar attributes (e.g. a query of the type“what apoptotic regulator genes are present in early clusters alongchemokine genes?”) within upstream or downstream clusters, the user maybe able to deduce, for example, that the apoptotic pathway in aparticular infection model of immune cells was altered by. either (a)the appearance of a new apoptotic regulator gene or chemokine at anunexpected time or cluster, or (b) the absence of altered expression agene known to be induced in the pathway. Alternatively, the user mightquery what low-likelihood roles or pathways might explain the presenceof a given class of receptors. In response to the user query, thepresent invention uses the user-specified query criteria to search theinformation stored in the database and outputs genes which satisfy theuser-specified search criteria by either their presence or omission fromeither known or low-likelihood roles (or pathways) or lists of geneswith known function(s) or role(s). In this manner, the informationstored for the plurality of DNA sequences and their behavior inexpression profile data facilitates identification of candidate genes.

[0085] Although specific embodiments of the invention have beendescribed, various modifications, alterations, alternativeconstructions, and equivalents are also encompassed within the scope ofthis application. The described invention is not restricted to operationwithin certain specific data processing environments, but is free tooperate within a plurality of data processing environments. For example,although the present invention has been described in a distributedcomputer network environment, the present invention may also beincorporated in a single stand-alone computer system. In such anenvironment, the same stand-alone computer has access to the variousbiological databases according to the present invention and may act bothas a client and a server. Additionally, although the present inventionhas been described using a particular series of transactions and steps,it should be apparent to those skilled in the art that the scope of thepresent invention is not limited to the described series of transactionsand steps.

[0086] Further, while the present invention has been described using aparticular combination of hardware and software, it should be recognizedthat other combinations of hardware and software are also within thescope of the present invention. The present invention may be implementedonly in hardware or only in software or using combinations thereof.

[0087] The specification and drawings are, accordingly, to be regardedin an illustrative rather than a restrictive sense. It will, however, beevident that additions, subtractions, deletions, and other modificationsand changes may be made thereunto without departing from the broaderspirit and scope of the invention as set forth in the claims.

What is claimed is:
 1. A computer-implemented method of identifyingcandidate genes from a plurality of DNA sequences, the methodcomprising: obtaining results of a homology search for the plurality ofDNA sequences, the homology search results comprising information abouthomologs of the plurality of DNA sequences; obtaining annotativeinformation for the plurality of DNA sequences, the annotativeinformation comprising information about the biochemical functions andphysiological roles of the plurality of DNA sequences; obtaining geneexpression profile data for the plurality of DNA sequences, the geneexpression profile data describing behavioral patterns of the pluralityof DNA sequences; clustering the plurality of DNA sequences based on thebehavioral patterns of the plurality of DNA sequences as described bythe gene expression profile data; storing the results of the homologysearch, the annotative information, the gene expression profile data,and results from clustering the plurality of DNA sequences in adatabase; receiving a query identifying criteria for the candidategenes; and searching the database, in response to the query, to identifya set of DNA sequences from the plurality of DNA sequences which satisfythe query criteria.
 2. The method of claim 1 wherein the homology searchfor the plurality of DNA sequences comprises performing BLAST analysis,Smith-Waterman analysis, Hidden Markov Model (HMM) analysis, and EMotifanalysis.
 3. The method of claim 2 wherein performing the BLASTanalysis, the Smith-Waterman analysis, the Hidden Markov Model (HMM)analysis, and the EMotif analysis comprises: performing the BLASTanalysis on-the first plurality of DNA sequences using a first databaseof sequences; identifying a second plurality of DNA sequences from thefirst plurality of sequences which are not known based on the BLASTanalysis using the first database of sequences; performingSmith-Waterman analysis on the second plurality of DNA sequences using aprotein database and a translated patent database; identifying a thirdplurality of DNA sequences from the second plurality of sequences whichare not known based on the Smith-Waterman analysis; performing HiddenMarkov Model (HMM) analysis and EMotif analysis on the third pluralityof DNA sequences using the protein database and GenBank database; andperforming BLAST analysis on the third plurality of DNA sequences usingGenBank EST database.
 4. The method of claim 1 wherein obtaining theannotative information comprises: identifying known genes from the firstplurality of DNA sequences based on the homology search; and accessinginformation sources storing annotative information for the known genes;and extracting the annotative information from the information sourcesfor the known genes.
 5. The method of claim 4 wherein extracting theannotative information comprises: assigning a reference score to theextracted annotative information based on the level of acceptance of therole or function of the known genes as described by the annotativeinformation such that annotative information with a high level ofacceptance is assigned a higher reference score than annotativeinformation with a low level of acceptance.
 6. The method of claim 4wherein the information sources include GenBank database, SWISS-PROTdatabase, Medline database, and biomedical publications.
 7. The methodof claim 4 wherein: accessing the information sources comprisesaccessing biomedical publications; extracting the annotative informationcomprises: for annotative information extracted from each biomedicalpublication: assigning a reference score to the extracted annotativeinformation based on characteristics of the biomedical publication, thereference score indicating the level of acceptance of the role orfunction of the known genes as described by the annotative informationextracted from the biomedical publication; and storing the annotativeinformation in the database comprises storing the reference score. 8.The method of claim 7 wherein assigning the reference score comprises:using a score derived from a citation index database to calculate thereference score, the score derived from the citation index databaseindicating the number of times that the annotative information from thebiomedical publication was referenced by other information sources. 9.The method of claim 7 wherein assigning the reference score furthercomprises: ranking the biomedical publications; and assigning thereference score to the annotative information extracted from thebiomedical publication based on the ranking of the biomedicalpublication.
 10. The method of claim 1 wherein clustering the pluralityof DNA sequences comprises determining relationships between clusters ofDNA sequences from the plurality of DNA sequences.
 11. The method ofclaim 1 wherein clustering the plurality of DNA sequences comprisesclustering the plurality of DNA sequences based on time-course datadescribed by the gene expression profile data.
 12. The method of claim 1wherein storing the information in the database comprises correlatingthe annotative information for the plurality of DNA sequences with thegene expression profile data for the plurality of DNA sequences.
 13. Amethod of identifying candidate genes comprising: configuring a queryidentifying criteria for the candidate genes; communicating the query toa server storing information related to a plurality of DNA sequences,the information comprising: results of a homology search for theplurality of DNA sequences, the homology search results comprisinginformation about homologs of the plurality of DNA sequences;information about the biochemical functions and physiological roles ofthe plurality of DNA sequences; information describing behavioralpatterns of the plurality of DNA sequences; and results from clusteringthe plurality of DNA sequences based on the behavioral patterns of theplurality of DNA sequences as described by the gene expression profiledata; and receiving from the server, in response to the query, a firstset of DNA sequences from the plurality of DNA sequences, wherein thefirst set of DNA sequences satisfy the criteria for the candidate genesidentified in the query.
 14. A data processing system for identifyingcandidate genes from a plurality of DNA sequences, the systemcomprising: a processor; and a memory coupled to the processor, thememory configured to store instructions for execution by the processor,the instructions comprising: instructions for obtaining results of ahomology search for the plurality of DNA sequences, the homology searchresults comprising information about homologs of the plurality of DNAsequences; instructions for obtaining annotative information for theplurality of DNA sequences, the annotative information comprisinginformation about the biochemical functions and physiological roles ofthe plurality of DNA sequences; instructions for obtaining geneexpression profile data for the plurality of DNA sequences, the geneexpression profile data describing behavioral patterns of the pluralityof DNA sequences; instructions for clustering the plurality of DNAsequences based on the behavioral patterns of the plurality of DNAsequences as described by the gene expression profile data; instructionsfor storing the results of the homology search, the annotativeinformation, the gene expression profile data, and results fromclustering the plurality of DNA sequences in the memory; andinstructions for searching the information stored in the memory, inresponse to a query identifying criteria for the candidate genes, toidentify a set of DNA sequences from the plurality of DNA sequenceswhich satisfy the query criteria.
 15. The system of claim 14 wherein thememory is further configured to store instructions for performing thehomology search, the instructions comprising: instructions forperforming BLAST analysis on the first plurality of DNA sequences usinga first database of sequences; instructions for identifying a secondplurality of DNA sequences from the first plurality of sequences whichare not known based on the BLAST analysis using the first database ofsequences; instructions for performing Smith-Waterman analysis on thesecond plurality of DNA sequences using a protein database and atranslated patent database; instructions for identifying a thirdplurality of DNA sequences from the second plurality of sequences whichare not known based on the Smith-Waterman analysis; instructions forperforming Hidden Markov Model (HMM) analysis and EMotif analysis on thethird plurality of DNA sequences using the protein database and GenBankdatabase; and instructions for performing BLAST analysis on the thirdplurality of DNA sequences using GenBank EST database.
 16. The system ofclaim 14 wherein the instructions for obtaining the annotativeinformation comprise: instructions for identifying known genes from thefirst plurality of DNA sequences based on the homology search; andinstructions for accessing information sources storing annotativeinformation for the known genes; and instructions for extracting theannotative information from the information sources for the known genes.17. The system of claim 16 wherein the instructions for extracting theannotative information comprise: instructions for assigning a referencescore to the extracted annotative information based on the level ofacceptance of the role or function of the known genes as described bythe annotative information such that annotative information with a highlevel of acceptance is assigned a higher reference score than annotativeinformation with a low level of acceptance.
 18. The system of claim 16wherein the information sources include GenBank database, SWISS-PROTdatabase, Medline database, and biomedical publications.
 19. The systemof claim 16 wherein: the instructions for accessing the informationsources comprise instructions for accessing biomedical publications; theinstructions for extracting the annotative information comprise:instructions for assigning a reference score to annotative informationextracted from each biomedical publication based on characteristics ofthe biomedical publication, the reference score indicating the level ofacceptance of the role or function of the known genes as described bythe annotative information extracted from the biomedical publication;and the instructions for storing the annotative information in thememory comprise instructions for storing the reference score.
 20. Thesystem of claim 19 wherein the instructions for assigning the referencescore comprise: instructions for using a score derived from a citationindex database to calculate the reference score, the score derived fromthe citation index database indicating the number of times that theannotative information from the biomedical publication was referenced byother information sources.
 21. The system of claim 19 wherein theinstructions for assigning the reference score comprise: instructionsfor ranking the biomedical publications; and instructions for assigningthe reference score to the annotative information extracted from thebiomedical publication based on the ranking of the biomedicalpublication.
 22. The system of claim 14 wherein the instructions forclustering the plurality of DNA sequences comprise instructions fordetermining relationships between clusters of DNA sequences from theplurality of DNA sequences.
 23. The system of claim 14 wherein theinstructions for clustering the plurality of DNA sequences compriseinstructions for clustering the plurality of DNA sequences based ontime-course data described by the gene expression profile data.
 24. Thesystem of claim 14 wherein the instructions for storing the informationin the database comprise instructions for correlating the annotativeinformation for the plurality of DNA sequences with the gene expressionprofile data for the plurality of DNA sequences.
 25. A system foridentifying candidate genes comprising: a communication network; a firstcomputer coupled to the communication network; and a second computercoupled to the communication network, the second computer configured tostore: results of a homology search for a plurality of DNA sequences,the homology search results comprising information about homologs of theplurality of DNA sequences; information about the biochemical functionsand physiological roles of the plurality of DNA sequences; informationdescribing behavioral patterns of the plurality of DNA sequences; andresults from clustering the plurality of DNA sequences based on thebehavioral patterns of the plurality of DNA sequences as described bythe gene expression profile data; wherein the first computer isconfigured to communicate a query to the second computer, the queryidentifying criteria for the candidate genes; and wherein the firstcomputer is configured to receive from the second computer, in responseto the query, a first set of DNA sequences from the plurality of DNAsequences which satisfy the criteria for the candidate genes identifiedin the query.
 26. A computer program product stored on acomputer-readable storage medium for identifying candidate genes from aplurality of DNA sequences, the computer program product comprising:code for obtaining results of a homology search for the plurality of DNAsequences, the homology search results comprising information abouthomologs of the plurality of DNA sequences; code for obtainingannotative information for the plurality of DNA sequences, theannotative information comprising information about the biochemicalfunctions and physiological roles of the plurality of DNA sequences;code for obtaining gene expression profile data for the plurality of DNAsequences, the gene expression profile data describing behavioralpatterns of the plurality of DNA sequences; code for clustering theplurality of DNA sequences based on the behavioral patterns of theplurality of DNA sequences as described by the gene expression profiledata; code for storing the results of the homology search, theannotative information, the gene expression profile data, and resultsfrom clustering the plurality of DNA sequences in a database; code forreceiving a query identifying criteria for the candidate genes; code forsearching the database, in response to the query, to identify a set ofDNA sequences from the plurality of DNA sequences which satisfy thequery criteria.
 27. A computer program product stored on acomputer-readable storage medium for identifying candidate genes, thecomputer program product comprising: code for configuring a queryidentifying criteria for the candidate genes; code for communicating thequery to a server storing information related to a plurality of DNAsequences, the information comprising: results of a homology search forthe plurality of DNA sequences, the homology search results comprisinginformation about homologs of the plurality of DNA sequences;information about the biochemical functions and physiological roles ofthe plurality of DNA sequences; information describing behavioralpatterns of the plurality of DNA sequences; and results from clusteringthe plurality of DNA sequences based on the behavioral patterns of theplurality of DNA sequences as described by the gene expression profiledata; and code for receiving from the server, in response to the query,a first set of DNA sequences from the plurality of DNA sequences,wherein the first set of DNA sequences satisfy the criteria for thecandidate genes identified in the query.